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[57] ABSTRACT 

A computer software architecture far controlling a highly 
parallel computer system comprises several layers of 
abstraction. The first layer is an abstract physical machine 
which contains a set of abstract physical processors. This 
layer may be considered as a microfcernel. The next layer 
includes virtual machines and virtual processors. A virtual 
machine comprises a virtual address space and a set of 
virtual processors that are connected in a virtual topology. 
Virtual machines are mapped onto abstract physical 
machines with each virtual processor mapped onto an 
abstract physical processor. The third layer of abstraction 
defines threads. Threads are lightweight processes that run 
on virtual processors. In a preferred embodiment the abstract 
physical machines, abstract physical processors, virtual 
machines, virtual processors, thread groups, and threads are 
all first class Objects. 

21 Claims, 9 Drawing Sheets 
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FIG. 6 



(define (filter op n input) 
(let loop ((x (hd input)) 

(output (fiake-strean) ) 
(last? true)) 
(cond ((zero? (mod x n)) 

(loop (rest input) output last?)) 
(last? 

(op (laibda 0 (filter op x output))) 
(loop (rest input) (attach x output) false)) 
(else (loop (rest input) (attach x output) last?))))) 

(define (sieve op n) 
(let ((input (nake- integer -stream n))) 
(op (lambda 0 (filter op 2 input))))) 



FIG. 7 

(define (create-3D-mesh depth) 
(let* ((pii-Midth (get-pm-width)) 
(poi-heigth iget-pm-height)) 
uD-mesn dake-array v h depth))) 
(let ((i o)) 
(if (< i pa-width) 
(let -«■- (Ij o)) 
(if (< ) pa-height) 
(let c (vp (create-vp (oet-pp i j)))) 
(set-vp- address vp (vector i j)) 
(let ((k 0)) 
(cond ((< k depth) 

(set-aref! 3D-nesh vp) 
(-.«.-(* k 1))))) 
[-.«- ( + j 1))))))))) 
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FIG. 8 

(define (start-context-switch next-state) 
(disable-preeaption) 
(leti (Ivp ( current -vp)) 

(next (let ((obi I tpn-get- next -thread vp))) 
(if (false? obi) 

(if I tcb-state -ready? next state! 
(current-tcb) 
(tpd . vp-idle vp)l 
objll)) 

(cond Keq? next (current-tcb))) 
Ktcb? next) 

(set- tcb. state next tcb-state/ running) 
(set- thread. vp (tcb. thread next) vp) 
(cond ( Ceq? next-state tcb-state/dead) 
(return-current-tcb-to-vp-pool vp) 
(restore- tcb-and-registers next)) 
(else 

(save-current-tcb-registers) 
( restore -tcb-and-registers next) ) ) ) 
((thread? next) 
(thread-nutex-acquire next) 
(let ((tcb (allocate- tcb next-state vp))) 
(setup-new-thread next vp tcb) 
(if (neg? tcb (current-tcb) ) 

(save-current-tcb-registers) ) 
(thread-nutex-release next) 
(start- nev-tcb tcb start-new-threadl))))) 



FIG. 9 

(define (finish-context-switch) 
(let* (Ivp I current -vp)) 

(previous (vp. current- tcb vp)) 
(state (tcb. state previous))) 
(set-vp. current- tcb vp (current-tcb)) 
(cond ((neg? previous (current-tcb)) 
(cond ((tcb-state-ready? state) 

(if (neg? previous (vp.root-tcb vp)) 

(tpi-enqueue-ready- thread vp previous))) 
((tcb-state-blocked? state) 
(thread-autex- release (tcb. thread previous))) 
Ktcb-state-suspended? state) 
(tpffl-enqueue -suspended- thread vp (tcb. thread previous)) 
(thread-nut ex-re lease (tcb. thread previous)))))) 
(enable-preemption))) 
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FIG. 10 



(define (start-new- thread) 
(let ((z (error-value "Thread has no value"))) 
(urwind-protect (set z (catch exit 

(set-tcb. exit-handler (current-tcb) exit) 
( ( thread . thunk (current- thread) ) ) ) ) 
(lot (I thread (current-thread))) 
(thread-gc thread) 
(set -thread. value thread z) 
Ivakeup-waiters thread) 
(set-tcb. state (current-tcb) tcb- state/dead) 
(star t-context-svi ten tcb-state/dead) ) ) ) ) 



FIG. 11 



(define (pbisort root spare up?) 
(cond ((node-left root) 

(let (deft-half (future (pbisort (node-left root) root up?)))) 
(pbisort (node-right root) spare (not up?)) 
(touch left-half) 
(pbimerge root spare up?))) 
(else (coopare-and-swap root spare up?)))) 

(define (pbimerge root spare up?) 
(let loop ((root root) (spare spare)) 
(if (coipare- and -swap root spare up?) 
(fixup-tree-1 root up?) 
(fixup-tree-2 root up?)) 
(cond ((node- left root) 

(let (deft-half (future (loop (node-left root) root)))) 
(loop (node-right root) spare) 
(touch left-half)))))) 
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FIG. 12 



(define (block-on-set count threads) 
(thread-mutex -acquire (current- thread) ) 
(let loop ((threads threads) 
(cnt count)) 
(cond ((fx- zero? cnt) 

( thread-outex-release (current- thread) ) ) 
((null? threads) 

(set-tcb. wit-count (current-tcb) cnt) 
(tcb-state->blocked (current-thread))) 
(else (let ((thread (car threads))) 
(thread-nutex-acquire thread) 
(cond ((thread-detenined? thread) 
(thread -iii tex-re lease thread) 
(loop (cdr threads) (fx- cnt 1))) 
(else (let ((tb dake-tbH) 

(set-tb. waiter tb (current- thread)) 

(set-tb. thread tb thread) 

(set-tb. next tb (thread, waiters thread!) 

(set-thread. waiters thread tb)) 

(thread-mutex-release thread) 

(loop (cdr threads) cnt)))))))) 
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SOFTWARE ARCHITECTURE FOR others use a combination of these. Interactive operating 

CONTROL OF HIGHLY PARALLEL systems ox batch operating systems have at least as many 

COMPUTER SYSTEMS scheduling alternatives if not more. 

Separating control from policy allows the building of one 

FIELD OF INVENTION 5 operating system that can be easily customized for various 

classes of operating systems. In the present invention the 

The present invention relates to a computer software modules implementing policy managers typically are very 

architecture designed to serve as a highly efficient substrate small relative to the size of the system — comprising in 

for modern programming languages. The software architec- general less man one hundred lines of code. Thus, usually it 

ture controls highly parallel computer systems. 1Q is only necessary to write a small piece of code to build a 

The present computer software architecture relies upon an new system with different policy behavior. Additionally, 

operating system which separates control issues from policy since the policy manager presents a well defined interface 

issues. This separation occurs at two different abstraction * c ^cy manager need be tested and not the 

levels in the sylZ: in the Abstract Physical Processor and ^^^\ m f ^^^p^^ , A 

in the Virtual Processor. Each of these abstractions is sepa- 15 Hydra- (Ascribed £ book 7^^ C ^ i: J^ 

rated into two components, the "controller" which imple- ^^^^X^ n ^^l SHSon *%JZ 

. ^73.*™ »^ Lexi& and Samuel Harbison, McGraw-Hill, 1991)* was the 

ments fce control portion of me abstracton and the Vhcy designed with the separation of the 
manager" which mates poUcy decisions f or -die conttoUer. ^d poUcy in mind, but Hyto^owed policy 
Separating control from policy penmts definmons of differ- nxatnb^iu&y* the kernel level The present invention 
ent behaviors for what is functionally the same system by 20 gocs further by allowing the programmer to customize 
modifying only the policy manager portion of the abstrac- < j cc j s i 0BS ^ they relate to a particular program. Thus 
ti° n - an interactive program such as an editor or window manager 
Specifically, the software architecture supports light- can have very different policies from a compute intensive 
weight threads of control and virtual processors as first-class program such as a fluid dynamics simulation or finite 
objects. Concurrency mangement is implemented in terms 25 elements computation. Separation of control from policy in 
of first class procedures and continuations, thereby permit- Hydra is also expensive since it required several context 
ting optimization of the runtime behavior of the application switches between the kernel and the policy manager. In the 
wimout requiring me user to haw present invention policy managers are generally linked 
lying runtime system. directly into the appropriate address space and require no 
w ./ „ . a • c io context svrftching. They are thus at least as efficient as 
More specify, the invention concerns a design for *> ^uonal (o^stomizable) operating system policy man- 
building asynchronous concurrency structures, an imple- usually more efficient 
mentation of a thread controller using continuations as its Qnc of ^c^ting a high-level parallel lan- 
basic control mechanism, an organization of large-scale ^ to bul]d a abdicated (user-level) virtual machine, 
concurrent computations and robust programming environ- v j xtua ^ machine serves primarily as a substrate that 
ments for parallel computing. 35 implements the high-level concurrency primitives found in 

the coordination sublanguage. Given a coordination lan- 

BACKGROUND OF THE INVENTION guagc L supporting concurrent primitive R the role of L's 

He growing interest in parallel computing has led to the ^^^j^J 5 to ^ * i ^ cmen ? don 
creation of a nuWctfpaxaUel programming languages that „ related ***** me f m ^ mc 

define explicit tu^levd program "and data strictures for 40 P™» sto ?f ' wtt^S^ 

. v ^ wv. 11 1 1 _ . , and the like. However, because L c is tailored only towards 

expressing concurrency. ParaUel^guages targeted for non- ^ Cementation of R if is often unsuitable for 

numerical application domains typically support (with vary- ^em^ting significantly different concurrency primitives, 
ing degrees of efficiency) concurrency structures mat realize ^ to bui £ adject of L with concurrency primitive P 
dynamic lightweight process creation, high-level synchro- 45 usua u y requires either building a new virtual machine or 
nization primitives, distributed data structures, and specu- expressing the semantics of P using P. Both approaches have 
lative concurrency. In effect, all these parallel languages obvious drawbacks: the first is costly to implement given the 
may be viewed as comprising two sublanguages — a coor- complexity of implementing a new virtual machine; the 
dination language responsible for managing and synchro- second is inefficient given the high-level semantics of P and 
nizing the activities of a collection of processes, and a ^ L^* s restricted functionality. 

computation language responsible far manipulating data Rather than building a dedicated virtual machine for 
objects local to a given process. implementing concurrency, a language implementation may 
Traditionally, there have been several classes of operating use low-level operating system services. Process creation 
system including: real time, interactive, and batch. These and scheduling is implemented by creating heavy- or light- 
three classes have provided different interfaces to the user 33 weight OS-managed thirds of control; synchronization is 
and thus porting a program from one class of operating handled using low-level OS-managed structures These 
system (OS) to another has been difficult Additionally, since implementations tend to be more portable and feasible 
the scheduling decisions made by each class are different, it s V stems aro^d a dedicated runtime system, but 
^ H1 ^HS U( ** MW,a Aim J ♦ they necessarily sacrifice efficiency since every (low-level) 
is difficult to debug a program for one, e.g. a real time . 1 , „ /. . „ a u^^a J,, ' 

application, on -WTI interactive development „ SUSSES £££££ 

system, and have confidence that the application will run OS facilities perform little or no optimization at either 
correctly and efficiently on the target system. compile time or runtime since they are usually insensitive to 

The situation is complicated further by the number of me semantics of the concurrency operators of interest 

different scheduling regimes used in each of these classes of 

systems. For exan£le!\ome real time systems use a fixed 65 SUMMARY OF THB INVENTION 

scheduling order for processes, some use a priority The present invention concerns the implementation of a 
discipline, some use a running quantum discipline, and still coordination substrate that permits the expression of a wide 
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range of concurrency structures within die context of a high 
level programmiiig language. The invention defines a 
general-purpose coordination model on top of which a 
Dumber of specialized coordination languages can be effi- 
ciently implemented. In practicing the invention, the soft- 
ware Scheme as described by Jonathan Rees and William 
dinger, editors in 'The Revised Report on the Algorithmic 
Language Scheme" in ACM Sigplan Notices, 21(12). 1986, 
was used as the computation base. Scheme is a higher-order 
lexically scoped dialect of lisp. While Scheme is the 
preferred language, it should be apparent to those skilled in 
the art that the design of the substrate could be incorporated 
into any modern (high-level) programming language. 

The operating system is designed to run primarily on 
MIMD (multiple instruction — multiple data) parallel 
computers, with either shared or disjoint memory, as well as 
distributed machines comprising networks of workstations. 
The software architecture uses a shared virtual memory 
model when executing on disjoint memory or distributed 
memory machines. It has been used to implement many 
different algorithms corresponding to different paradigms of 
parallelism including result parallelism, master/slave paral- 
lelism and speculative parallelism Several different parallel 
programming models have been implemented on top of the 
operating system including futures, first class tuple spaces 
and engines. 

The dialect of Scheme comprising the operating system 
which is a feature of a preferred embodimjent of the present 
invention (called Sting), includes a coordination language 
(implemented via a dedicated virtual machine) for express- 
ing asynchronous lightweight concurrency that combines 
the best of both approaches. In contrast to other parallel 
Scheme systems and parallel dialects of sinular high-level 
languages, the basic concurrency objects in Sting (threads, 
virtual processors, and physical processors) are streamlined 
data structures with no complex synchronization require- 
ments. Unlike parallel systems mat rely on OS services for 
manag in g concurrency. Sting implements all concurrency 
management issues in terms of Scheme objects and 
procedures, penmtting users to optimize the runtime behav- 
ior of applications without requiring knowledge of under- 
lying OS services. Sting supports the features essential to 
creating and managing various forms of asynchronous par- 
allelism with a conceptually unified and very general frame- 
work. 

Results have shown mat it is possible to build an efficient 
substrate upon which various parallel dialects of high-level 
languages can be built Sting is not intended merely to be a 
vehicle that implements stand-alone short-lived programs, it 
is anticipated that this system will provide a framework for 
building a rich programming environment for parallel com- 
puting. In this regard, the system provides support for thread 
preemption, per-thread asynchronous garbage-collection, 
exception handling across thread boundaries, and applica- 
tion dependent scheduling policies. In addition, it contains 
the necessary functionality to handle persistent long-lived 
objects, multiple address spaces and other features com- 
monly associated with advanced prograrnming environ- 
ments. 

In Sting, virtual processors are multiplexed on abstract 
physical processors and threads are multiplexed on virtual 
processors. All policy decisions relating to this multiplexing 
are decided by policy managers. Decisions relating to the 
multiplexing of virtual processors on physical processors are 
made by the Virtual Processor Policy Manager (VPPM) 
while decisions relating to the multiplexing of threads on 
virtual processors are made by the Thread Policy Manager 
(TPM). 
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Policy managers make three types of decisions: how to 
map a new object (VP or thread) onto a processor (physical 
or virtual) when the object is created or resumed, what order 
to run the objects mapped on a particular processor, and 
5 when to remap or move an object from one processor to 
another. 

Sting is an operating system designed to support modern 
prograrnming languages such as Scheme, SmallTalk, ML, 
Modula3, or Haskell. It provides a foundation of low level, 

io orthogonal constructs, that allows the language designer or 
implementor to build the various constructs required by 
these languages easily and efficiently. 

Modern programming languages have more extensive 
requirements than traditional programming languages such 

15 as Cobol, Fortran, C or Pascal Even though Sting is 
designed to support modem prograrnming languages, it 
accommodates traditional programming languages, just as 
efficiently. The list below identifies some of the elements 
that distinguish modern from traditional languages. 

Parallelism— The growing availability of general purpose 
multi-processors has led to increased interest in building 
efficient and expressive platforms for concurrent program- 
ming. Most efforts to incorporate concurrency into high* 

22 level programming languages involve the addition of special 
purpose primitives to the language. 

Multiple Synchronization Models — -There are many syn- 
chronization protocols used in parallel or asynchronous 
programming. A modern operating environment should as 

30 far as possible provide the rnimirives to support the various 
protocols. 

Lazy and Eager Evaluation— Many modern languages 
support either lazy or eager evaluation or both. It is impor- 
tant fox the operating system to provide the full range of 
35 evaluation strategies from lazy to eager. 

Automatic Storage Management— This has become a 
fundamental feature of many modern languages, because 
automatic storage management allows more expressive 
program s, while at the same time reducing both the number 
40 of errors in and the complexity of programs. 

Topology Mapping— While not yet supported in many 
programming languages, the ability to control the mapping 
of processes to processors so as to reduce the communica- 
tion overhead of a program will become more important as 
45 the size of multi-processor computer systems continues to 
grow and the topologies become more complex. 

Sting supports these various elements efficiently. It does 
so in an architectural framework that is more general and 
M more efficient than those currently available. It also provides 
the programmer with an increased level of expressiveness 
and control and an unparalleled level of customizability. 

Four features of the Sting design, when taken as a whole, 
best distinguish the software architecture from other parallel 
55 languages: 

1. The Concurrency Abstraction: Concurrency is 
expressed in Sting via a lightweight thread of control. 
A thread is a non-strict first-class data structure. 

2. The Processor and Policy Abstractions: Threads 
60 execute on a virtual processor (VP) mat represents an 

abstraction of a scheduling and load-balancing proto- 
col. There may be many more virtual processors than 
the actual physical processors available, like threads, 
virtual processors are also first-class objects. A VP 
63 contains a thread policy manager mat detcrniines the 
scheduling and migration regime for the threads mat it 
executes. Different VPs can, in fact, contain different 
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thread policy managers without incurring any perfor- possible to define procedures that map virtual processors in 

mancc penalty. Virtual processors execute on physical the virtual topology to physical processors in the platform, 

processors which are abstractions of actual physical The code itself is machine independent insofar as it does 

computing devices. not contain any reference to physical processors or their 

A collection of virtual processors and an address space 5 interconnection. All concerns related to thread mapping and 

combine to form a virtual machine. Multiple virtual locality are abstracted in the specification of the virtual 

machines can execute on a single physical machine; physical topology used by the program and the manner in which 

machines comprise a set of physical processors. Virtual and nodes in the topology are traversed during program execu- 

physical machines are also denotable Scheme objects and tion. 

may be manipulated as such. io The benefit of virtual topologies and processor mappings 

3. Storage Model: A thread allocates data on a stack and is not only efficiency but also portability so that implemen- 
heap that it manages exclusively. Thus, threads garbage tations of parallel algorithms need not be specialized for 
collect their private state independently of one another; different physical topologies. Since the mapping algorithm 
no global synchronization is necessary in order for a used to associate threads with processors is specified as part 
thread to initiate a private garbage collection. Data may 15 of a virtual topology, programmers have fine control over 
be referenced across threads. Inter-area reference infer- how threads should be mapped to virtual processors. If the 
mation is used to garbage collect objects across thread communication requirements of the threads generated by a 
boundaries. Storage is managed via a generational computation arc known in advance, the ability to explicitly 
scavenging collector; long-lived or persistent data alio- allocate these threads onto specific virtual processors can 
cated by a thread is accessible to other threads in the 20 lead to better load haUnring than can an implicit mapping 
same virtual machine. strategy. The structure of a control and dataflow graph 

The design is sensitive to storage locality concerns; for defined by a parallel algorithm can be exploited in a number 

example, storage for running threads is cached on VPs and of ways. If a collection of threads share common data, it is 

is recycled for immediate reuse when a thread terminates. possible to construct a topology that maps the virtual pro- 

Moreover, multiple threads may share the same execution 25 cessors on which these threads execute to the same physical 

context whenever data dependencies warrant processor. Virtual processors are multiplexed on physical 

4. The Program Model: Sting permits exceptions to be processors in the same way threads are multiplexed on 
handled across threads, supports non-blocking I/O, virtual processors. If a collection of threads have significant 
permits the customized scheduling of virtual processors communication requirements, it is possible to construct a 
in the same way that the scheduling of threads on 30 topology that maps threads that communicate with one 
virtual processors is customizable, and provides an another onto virtual processors close together in the virtual 
infra-structure for implementing multiple address topology. If thread T t has a data dependency with the value 
spaces and shared persistent objects. Sting also sup- yielded by another thread T 2 , it is reasonable to map T u and 
ports efficient message-passing communication via the T 2 on the same virtual processor. In fine grained programs 
use of first class polymorphic ports. Ports serve to 95 where processors are busy most of the time, the ability to 
alleviate overheads in the implementation of shared schedule data-dependent threads on the same or a nearby 
memory on disjoint memory platforms. processor leads to opportunities for improved thread granu- 

The present invention of a software architecture for larity. Finally, certain algorithms have a process structure 
controlling a highly parallel computer system integrates an that unfolds as die computation progresses, such as adaptive 
operating system (Sting), a base language and a compiler 40 tree algorithms. These algorithms are best executed on 
into an abstract machine. The starting point is a high level topologies that allow dynamic creation of virtual processors, 
programxning language, such as Scheme. The programming Other novel aspects of the software architecture include 
language is augmented with efficient abstractions including the role of continuations and first class procedures in the 
threads, virtual processors and policy managers. This novel implementation of an efficient and general-purpose multi- 
operating system includes mechanisms that take advantage 45 threaded operating system and programming environment 
of current architectural trends that place a premium on data Continuations are used to implement state transition 
locality. operations, exception handling and important storage opti- 

The result is a mechanism for building efficient coordi- mizations. A continuation is an abstraction of a program 

nation structures for parallel computing. The use of light- point It is represented as a procedure of one argument that 

weight threads provides a foundation for advanced program- 50 defines the remaining computation needed to be performed 

ming environments. Support for data locality results in an from the program point it denotes, 

efficient asynchronous system. A Sting virtual address space is composed of a set of 

Central to the system performance is the concept of a areas. Areas are used for organizing data that exhibit strong 

virtual topology. A virtual topology defines a relation over a temporal or spatial locality. Sting supports a variety of areas: 

collection of virtual processors; processor topologies con- 55 thread control blocks, stacks, thread private heaps, thread 

figured as trees, graphs* hypercubes and meshes are some of shared heaps etc. Data are allocated to areas based on their 

the well-known examples. A virtual processor is an abstrac- intended use and lifetime and thus, different areas can have 

tion that defines scheduling, migration and load balancing different garbage collectors associated with them, 

policies for the threads it executes. TTie virtual topologies are Exceptions and interrupts are always handled in the 

intended to provide a simple and expressive high level 60 execution context of some thread, as is the case with a 

framework for defining complex thread/processor mappings thread-level context-switch. Exception handlers are imple- 

that abstracts low-level details of a physical interconnection. mented as ordinary Scheme procedures and dispatching an 

Threads created by a computation are mapped to proces- exception primarily involves manipulatin g continuations, 

sors in the virtual topology via mapping functions associated Insofar as Sting is a programming system that permits the 

with the topology. These mapping functions are user defin- 65 creation and management of lightweight threads of control, 

able. Given an implementation of a system using virtual it shares several common traits with thread package systems 

topologies on a particular multiprocessor platform, it is developed for other high-level languages. These systems 
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also view threads as a manifest datatype, support preemption A principal object of the present invention is therefore, the 

in varying degrees, and in certain restricted cases, permit provision of a computer operating system architecture for 

programmers to specify a specialized scheduling regimen, controlling a highly parallel imiltiprocessor/miilticomputer 

The thread abstraction defines the coordination sublanguage system to serve as a highly efficient substrate for modern 

in these systems. 5 programming l a n gu a ges. 

There are some important differences, however, that A further object of the present invention is the provision 

clearly distinguish Sting from these other systems. First, the of a software architecture for asynchronous computation 

scheduling and migration protocol Sting uses is completely based upon customizable virtual machines, 

customizable; different applications can run different sched- Another object of the present invention is the provision of 

ulers without modifying the thread manager or the virtual to a software architecture which supports lightweight threads 

processor abstraction; such customization can be applied to on virtual processors as first-class objects, 

the organization of the virtual machine itself. Second, A still further object of the present invention is the 

Sting's support for data locality, storage optimization, and provision of a software architecture which includes customi- 

process throttling via thread absorption is absent in other zable policy m anage m ent, particularly at the user level, 

systems. Moreover, all thread operations are implemented 15 A yet another object of the present invention is the 

directly within the Sting virtual machine: there is no context provision of a software architecture which includes customi- 

switch to a lower level kernel that must be performed in zable virtual topologies. 

order to execute a thread operations. Sting is built on an An object of the present invention is the provision of a 

abstract machine intended to support long-lived software architecture which includes thread absorption, 

applications, persistent objects, and multiple address spaces. 20 delayed TCB allocator and thread groups as a locus of 

Thread packages provide none of this functionality since (by storage sharing. 

definition) they do not define a complete program environ- An object of the present invention is the provision of a 

meat software architecture system which includes polymorphic 

Since Sting is designed as a systems programming ports, 

language, it provides low-level concurrency abstractions— 25 Other and still further objects of the present invention will 

application libraries can directly create thread objects, and be best understood when the following description is read in 

can define their own scheduling and thread migration strat- conjunction with the accompanying drawing, 

egies. High-level concurrency constructs are realizable „™™^v., ^„ mm ^ 4TirTVT ^ 

uTing threads, but the system does not prohibit users from BRIEF DESCRIPTION OF THE DRAWING 

directly using thread operations in the ways described 30 mG l{$a schcmatic block diagram representation of the 

above, if efficiency considerations warrant In particular, the softwaxe architecture comprising the present invention; 

same application may define concurrency factions with representation of an abstract physical ir*cfaine 
different semanUcs and efficiency concerns within the same 

runtime environment. t _ . 

In certain respects, Sting resembles other advanced multi- 35 FIG. 3 is a schematic block diagram representation of me 

threaded operating systems: for example, it supports non- abstract architecture of the operating system; 

blocking I/O calls with call-back, user control over FIG. 4 is a thread state and TCB-statc transition diagram; 

interrupts, and local address space management as user-level FIG. 5 is an schematic representation of a storage orga- 

operations. It separates user-level and kernel-level concerns: nization; 

physical processors handle (privileged) system operations 40 pjQ ^ ^ ^ m U strative program demonstrating program- 

and operations across virtual machines; virtual processors m « witn threads* 

implement all user-level thread and local jdotes-space illustrative program for creating a 3D mesh 

functlonatity. Howevcx^seSting is an extended ^dialect ™^ ^2^^ a 2D mesh of physical 

of Scheme, it provides the functionality and expressivity of OI vmu,u pj r 

a higMevcl programming language that typical operating 45 F°c*ssors, 

system eirvironments do noToffer. FIG. 8 is a program for initiating a context switch; 

Sting is a platform for building asynchronous program- FIG. 9 is a program for finishing a context switch; 

ming primitives and experimenting with new parallel pro- fig. 10 is a program for starting a new thread; 

gramming paradigms. In additional the design also allows pjQ Uisa program of the top-level procedures for a 

different concurrency models to be evaluated competitively. 50 fae-ffatoet adaptive parallel sorting algorithm; and 

Scheme offers an especially rich environment in which to a^^„ ki~*v^« c*t 

undertake such experiments because of its weU-defined mG ' n 18 a Modern-set 

semantics, its overall simplicity, and its efficiency. However, DETAILED DESCRIPTION 
the Sting design itself is language independent; and thus, it 

could be mcorporated fairly easily into any high-level pro- 55 Referring now to the figures and to FIG. 1 in particular, 

gramming language. there is shown a schematic block diagram representation of 

Sting does not merely provide hooks for each concurrency the software architecture comprising the present invention, 

paradigm and primitive considered interesting. Instead the An Abstract Physical Machine (PM) 10 comprises a set of 

focus is on basic structures and functionality common to a Abstract Physical Processors (PP) 12 connected to each 

broad range of parallel prograrnming structures; thus, the 60 other in a Physical Topology (FT) 11. The Abstract Physical 

implementation of blocking is easily used to support specu- Machine is used to execute a set of Virtual Machines (VM) 

lativc computatioa The thread absorption optimization used 14. Each Virtual Machine, in turn* comprises one or more 

to throttle the execution of threads is well-suited for imple- Virtual Processors (VP) 16 connected in a Virtual Topology 

menting futures and tuple-space synchronization, and, (VT) 20, 20*. Threads (T) 18 execute on one or more of the 

finally customizable policy managers make it possible to 63 virtual processors in the same virtual machine. Moreover, a 

build fair and efficient schedulers for a variety of other particular thread may migrate between different virtual 

paradigms. processors in the same virtual machine 14. A thread policy 
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manager (TPM) 19 (shown in FIGS. 2 and 3) controls the local address space. Virtual machines also contain the root of 

thread scheduling and thread load balancing policy. The a live object graph (ie., the root environment 30) that is used 

interactions among and the details of the different elements to trace all live objects in its address space, 

are described below. All Sting objects (including threads, VPs, and virtual 

The software architecture (also sometimes referred to as 3 machines) reside in a persistent memory. The memory Is 
the operating system architecture) can be considered as an organized in terms of a collection of disjoint areas. Objects 
arrangement comprising several layers of abstraction (FIG. are garbage collected within an area using a generational 
2). The first layer includes the abstract physical machine 10, coflector. An object can reference any other object found in 
which contains a set of abstract phy sical processors 12. This its address space. Objects initially reside in short-lived 
layer corresponds to what is referred to as the micro-kernel 10 thread-local areas. Objects mat survive garbage collection 
in state-of-the-art operating systems. The next layer includes percolate upwards in the generation hierarchy. This tune- 
virtual machines 14 and virtual processors 16. A virtual tionality is completely transparent to the user, 
machine comprises a virtual address space and a set of A first-class object is an object that can be passed as an 
virtual processors that are connected in a virtual topology. argument to a procedure, returned as a result from a proce- 
Virtual machines are mapped onto abstract physical 15 dure or stored in data structures. In a preferred embodiment 
machines, with each virtual processor mapped onto an of the present invention abstract physical machines, abstract 
abstract physical processor. Hie third layer of abstraction physical processors, virtual machines* virtual processors, 
defines threads 18. The threads are lightweight processes thread groups and threads are all first-class objects. Id an 
that run on virtual processors. alternative embodiment, only threads and virtual processors 

For example, a virtual topology might represent a tree of 20 are first-class objects, 

virtual processors that is mapped onto physical processors The Sting compiler is a modified version of Orbit Orbit 

that are physically connected in a mesh topology. Virtual is described in an article by D. Kranz, et at, entitled "Orbit: 

topologies allow the programmer to express a program in a An Optimizing Compiler for Scheme 9 * in ACM SIGFLAN 

(virtual) topology that is suitable to the algorithm being Notices, 21 (7): 219-233, July 1986. The target machine 

implemented Sting provides an efficient mapping from a 25 seen by the compiler includes a dedicated thread register to 

virtual topology onto the actual physical topology of the hold a reference to the currently running thread object 

target machine. Virtual topologies also allow parallel pro- Moreover, time critical operations such as those that save 

grams to be portable across different physical topologies. and restore registers on a context switch, or allocate thread 

The main components in the coordination sublanguage of _ storage (i.e., stack and heaps) are provided as primitive 

Sting are lightweight threads of control and the virtual operations. Sequential Scheme programs will compile and 

processors. Threads are simple data structures that encap- execute without modification. Sting implementations of 

sulate local storage (i.e., registers, stacks and heaps), code, futures, distributed data structures, and speculative concur- 

and relevant state information (Le., status, priorities, pre- rency operations also exist; Scheme programs can be freely 

emption bits, locks, etc.). They define a separate locus of 3J augmented with the concurrency operations supported by 

control The system imposes no constraints on the code any of these paradigms. 

encapsulated by a thread: any valid Scheme expression can Threads are first-class objects in Sting. Thus, they may be 

be treated as a distinct process. passed as arguments to procedures, returned as results, and 

Referring to FIGS. 2 and 3. Each virtual processor (VP) stored in data structures. Threads can outlive the objects mat 

16 includes a Thread Controller (TQ 17 which implements ^ creatc A thread's state contains a thunk, a miliary 

a state transition function on threads and a Thread Policy procedure that is invoked when the thread executes. The 

Manager (TFM) 19 which implements both a thread sched- value of the application is stored in the thread on comple- 

uling and load-balancing/migration policy. While within the uoa * Per example, evaluating the expressions: 

same virtual machine each VP shares the thread controller, (**-cm * *» events a lightweight thread of control that 

different VPs may contain different thread policy managers. 4S bmfes the thunk o-k*o «»». 

Virtual processors 16 are multiplexed on physical proces- The evaluation environment of this thunk is the lexical 

sors 12 in the same way that threads 18 are multiplexed on environment of the fork-thread expression, 

virtual processors. Each physical processor corresponds to a a thread records status information as part of its state (see 

computing engine in a multiprocessor environment Asso- FIGS. 4 and 5). 

dated with each physical processor PP is a virtual processor 50 a thread can be either delayed 36, scheduled 38, evalu- 

controUer 13 and a virtual processor policy manager IS. The ating 40, absorbed 42 or determined 44. A delayed thread 

virtual processor controller will context switch among vir- w iu never be run unless the value of the thread is explicitly 

tual processors because of preemption, or because it is demanded. A scheduled thread is a thread that is known to 

explicitly requested to do so. The virtual processor policy some VP, but which has not yet been allocated storage 

manager handles scheduling decisions for virtual processors 55 resources. An evaluating thread is a thread mat has started 

16 that execute on a physical processor PP. For example, a running. A thread remains in this state until the application 

virtual processor may relinquish control of its physical 0 f its thunk yields a result, at which point its state becomes 

processor if do threads are executing on it, and none can be determined. Absorbed threads are an important specializa- 

migrated from any other VP. Physical processors may run tion 0 f evaluating threads and are discussed In more detail 

virtual processors of any virtual machine found in the $0 below. 

system. in addition to state information and the code to be 

Virtual machines encapsulate a single address space 24 evaluated, a thread also contains (1) references to other 

accessible exclusively by its associated virtual processors threads waiting for it to complete, (2) the thunk* s dynamic 

and threads, virtual machines may share global information and exception environment, (3) genealogy information 

(e.g., libraries, file systems, etc) located in global storage 65 including its parent, siblings, and children, 

pool 26 and are responsible for mapping global objects 28 Each thread also has dynamic and exception environ- 

(te., objects resident in a global address space) into their ments that are used to implement fluid (or dynamic) bind- 
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ings and exception handling. Genealogy infonnation serves 
as a useful debugging and profiling tool that allows appli- 
cations to monitor the dynamic unfolding of a process tree. 

The implementation of threads requires no alteration of 
the other primitive operations in the language. The synchro- 
nization semantics of a thread is a more general (albeit 
lower-level) form of the synchronization facility available 
via e.g., MuluUsp's 'touch", Linda's tuple-space, or GML's 
"sync*. 

The application completely controls the condition under 
which blocked threads may be resumed. However, there is 
explicit system support for dataflow (Le„ future-touch), 
non-deterministic choice, and constraint-based or barrier 
synchronization. 

Users manipulate threads via a set of procedures (listed 
below) defined by a thread controller (TC) which imple- 
ments synchronous state transitions on a thread's state* The 
TC preferably is written entirely in Scheme with the excep- 
tion of two primitive operations to save and restore registers. 
The thread controller allocates no storage; thus, a TC call 
never triggers a garbage collection. In addition to these 
operations, a thread can enter the controller because of 
preemption. Thread procedures include: 
(fork-ttuead expr vp) creates a thread to evaluate expr, and 

schedules it to run on vp. 
(delay-thread expr) creates a delayed thread that when 

demanded (via thread-value) evaluates expr. 
(thread-run thread vp) inserts a delayed, blocked or sus- 
pended thread into vp's ready queue, 
(thread-wait thread) causes the thread executing this opera- 
tion to block until thread's state becomes determined, 
(thread-block thread . blocker) requests thread to block; 

blocker is the condition on which the thread is blocking, 
(thread-suspend thread . quantum) requests thread to sus- 
pend execution. If the quantum argument is provided, the 
thread is resumed when the period specified has elapsed; 
otherwise, the thread is suspended indefinitely until it is 
explicitly resumed using unread-run. 
(thread-terminate thread . values) requests thread to termi- 
nate with values as its result 
(yield-processor) causes the current thread to relinquish 
control of its VP. The thread is inserted into a suitable 
ready queue. 

(cinrent-thread) returns the thread executing this operation, 
(current-virtual-processor) returns the virtual processor on 

which this operation is evaluated. 

To illustrate how users might program with threads, 
consider the program shown in FIG. 6 which defines a 
simple prime finder implementation. The definition makes 
no reference to any particular concurrency paradigm; such 
issues arc abstracted by its op argument 

This implementation relies on a user-defined synchroniz- 
ing thread abstraction that provides a blocking operation on 
stream access (hd) and an atomic operation for appending to 
the end of a thread (attach). 

Various implementations of a prime number finder may be 
defined that exhibit different degrees of asynchronous 
behavior. For example, 



(let ((filter-list (lot))) 
(sieve (lambda (think) 
(act fitter-list (coos (delay-thread (Una*)) 
(fiher-list)))))) 



defines an implementation in which filters are generated 
lazily; once demanded, a filter repeatedly removes elements 
off its input stream, and generates potential primes onto its 
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output stream. To initiate a new filter scheduled on a VP 
using a round-robin thread placement discipline, it is pos- 
sible to write: 

5 ' 
(thread-ran (car filter-Hat) 

(mod (1+ (vm.vp-vector (vp.vm (asnint-viitiial-processor)))) 
a)) 



jq The expression, (rp.vm (amnt-vatual-prooesaor)) defines the vir- 
tual marhine of which the current VP is a part A virtual 
machine's public state includes a vector containing its 
virtual processors. 
By slightly rewriting the above call to sieve, we can 

l5 express a more lazy implementation: 



(let ((fitter-list (list))) 
(sieve (lambda (ttnmk) 
(set filter-tint 
(cons (aeate-t&read 
(begin 

(map ttoead-nm filter-list) 
<munk))) 
fitteMist)) 
(map thread-block filter-list)) 



In this definition, a filter that encounters a potential prime p, 
creates a lazy thread object L and requests all other filters in 
the chain to block. When L*s value is demanded, it unblocks 

30 all the elements in the chain, and proceeds to filter all 
multiples of p on its input thread. This implementation 
throttles the extension of the sieve and the consumption of 
input based on demand. 
It is also possible to define an eager version of the sieve 

35 as follows: 



(sieve (lambda (drunk) 

(fbrk-thread-(tixuxik) (cufieul»vp))) 

B) 

40 

Evaluating this application schedules a new thread respon- 
sible for filtering all multiples of a prime; this thread Is 
scheduled on the virtual processor executing this operation. 

45 In this call, an evaluating thread is generated whenever a 
new prime is encountered. 

Sting treats thread operations as ordinary procedures, and 
manipulates the objects referenced by them just as any other 
Scheme object; if two filters attached via a common stream 

50 are terminated, the storage occupied by the stream may be 
reclaimed. Sting imposes no a priori synchronization pro- 
tocol on thread access — application programs are expected 
to build abstractions that regulate the coordination of 
threads. 

55 The threads created by filter may be terminated in one of 
two ways. The top-level call to sieve may be structured so 
that it has an explicit handle on these threads; the filter-list 
data structure used to create a lazy sieve is such an example. 
One can then evaluate: 

60 , - ^ . ^ 

to terminate all threads found in the sieve. Alternatively, 
applications may use thread groups to collectively manage 
these threads. 
63 Thread Groups 

Sting provides thread groups as a means of gaining 
control over a related collection of threads. A thread group 
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is created by a call to fork-thread-group; this operation 
creates a new group and a new thread that becomes the root 
thread of that group. A child thread shares the same group as 
its parent unless it explicitly creates a new group. A thread 
group includes a shared heap accessible to all its members. 5 
When a thread group terminates via the call. 



10 



all live threads in the group are terminated and its shared 
heap is garbage collected. 

A thread group also contains debugging and thread opera- 
tions that may be applied en masse to all of its members. 
Thread groups provide operations analogous to ordinary 
thread operations, (e.g., termination, suspension, etc.) as 
well as operations for debugging and monitoring (e.g., 
listing all threads in a given group, listing all groups, 
profiling, genealogy information, etc..) Thus, when thread T 
is terminated, users can request all of Ts children (which are 
defined to be part of Ts group to be terminated) thus: 



Thread groups are an important tool for controlling shar- 
ing in a hierarchical memory architecture. Since objects 
shared by members in a groups are contained in the group's 25 
shared heap, they arc preferably physically close to one 
another in memory, and thus better locality is exhibited 
Threads groups can also be used as a locus for scheduling. 
For example, a thread policy manager might implement a 
scheduling policy in which no thread in a group is allowed 30 
to run unless all threads in the group are allowed to run; this 
scheduling regime is similar to a "gang scheduling" proto- 
col. Thread groups can be used in conjunction with virtual 
topologies to improve data locality. 

Execution Contexts and Thread Control Blocks 35 

When a thread begins evaluating, an execution context is 
allocated for it. Every evaluating thread is associated with an 
execution context that is also known as a Thread Control 
Block (TCB) 32 (FIG. 5). A TCB is a generalized represen- 
tation of a continuation and includes its own stack 31 and 40 
local heap 33. Both stacks and heaps can be chained, and 
heaps are garbage collected using a generational scavenging 
collector. Besides storage objects, a TCB includes an asso- 
ciated lock, values of all live registers extant at the time the 
thread last performed a context switch, the thread's sub-state 45 
(e.g., initialized, ready, evaluating, blocked, suspended, 
etc), the VP on which the thread was last executing, thread 
priority, and time quantum. 

The thread state and sub-state transition diagram is shown 
in FIG. 4. TCB states reflect the operations allowed on 50 
evaluating threads. If evaluating thread T has TCB the 
state field of T TCff indicates one of the following: 
initialized 46: The stack and heap associated with T 1CB have 

been initialized, but no code has yet been executed, 
ready 49: T can execute on any available VP, but is not 55 

currently executing on any. 
running 50; T is currently executing on some VP. 
blocked 52: T is currently blocked on some thread or 

condition. 

suspended 54: T is suspended for some potentially infinite 60 
duration. 

terminated 56: T has finished executing and is cleaning up 
residual state. 

Unlike threads, TCB's are not first-class user-visible 
objects; they are accessible only to thread controllers and 65 
thread policy managers. When a new thread is ready to run, 
a TCB is allocated for it; when a thread becomes determined, 



its TCB is available for use by the thread controller for 
threads created subsequently. TCBs never escape into user 
maintained data structures; they are manipulated exclusively 
by system-level procedures. 

The Sting implementation defers the allocation of storage 
for a thread until necessary. In other thread packages, the act 
of creating a thread involves not merely setting up the 
environment for the thread to be forked, but also allocating 
and intializing storage. This approach lowers efficiency in 
two important respects: first in the presence of fine-grained 
parallelism, the thread controller may spend more time 
creating and initialising threads t han actually running them. 
Second, since stacks and process control blocks are imme- 
diately allocated upon thread creation, context switches 
among threads often cannot take advantage of cache and 
page locality. In addition, if TCB allocation is not delayed 
the total memory requirements of the system can be signifi- 
cantly increased. 

Thread control blocks in Sting are recyclable resources 
that are managed by virtual processors. A TCB is allocated 
for a thread only when the thread begins evaluation. The 
allocation strategy is designed to improve data locality. A 
TCB may be allocated to a thread T that is to run on VP V 
in one of four ways: 

1. If the thread currently executing on V has just 
terminated, its context is available for immediate 
re-allocation. Its TCB is the best candidate for alloca- 
tion because it has the most locality relative to its VP. 
The physical caches and memory associated with this 
VP are most likely to contain (he execution context of 
the thread mostly recently running on that VP. 

2. If die currently executing thread has not terminated, a 
TCB for T is allocated from a UFO pool of TCBs 
maintained on V. Here again, the execution context is 
the one with the most locality. 

3. If V*s pool is empty, a new TCB is allocated from a 
global pool that is also organized in LBFO order. Every 
local VP pool maintains a threshold T of the number of 
TCBs it may hold. When a pool overflows, its VP 
moves half the TCBs In the local pool to the global 
pool; when the local pool underflows, t/2 TCBs arc 
moved from the global pool to the VP-local one. Global 
pools serve two roles: (1) to minimize the impact of 
program behavior on TCB allocation and reuse, and (2) 
to ensure a fair distribution of TCBs to all virtual 
processors. 

4. Finally, if no TCB is available cither in the global or 
local pool, a new set of x/2 TCBs is dynamically 
created and allocated to T. Since new TCBs are created 
only if both the global and the VP local pool are empty, 
the number of TCBs actually created during the evalu- 
ation of a Sting program is determined collectively by 
all VPs. 

Virtual Rocessors 

Virtual processors (and by extension, virtual machines) 
are first-class objects in Sting. According first-class status to 
VPs has several important implications that distinguish 
Sting from both high-level thread systems and other asyn- 
chronous parallel languages. First, one can organize parallel 
computations by explicitly mapping processes onto specific 
virtual processors. For example, a thread P known to com- 
municate closely with another thread Q that is executing on 
VP V should execute on a VP topologically near V, Such 
considerations can be expressed in Sting since VPs can be 
directly denoted. Systolic style programs for example can be 
expressed by using self-relative addressing off the current 
VP (e.g., current- VP, left-VP, right-VP, up-VP, etc). The 
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system provides a number of default addressing modes for 
many common topologies (e.g., hypercubes, meshes, sys- 
tolic arrays, etc). Furmermore, since VPs can be mapped 
onto specific physical processors, the ability to manipulate 
virtual processors as first-class data values gives Sting 3 
programmers a great deal of flexibility in expressing differ- 
ent parallel algorithms that are denned in terms of specific 
processor topologies. 

To illustrate, consider the program shown in HQ. 7 which 
creates a 3D mesh of virtual processors multiplexed on a two 10 
dimensional mesh of physical processors. The array has 
height and width equal to the height and width of the 
physical machine. The mapping collapses the three dimen- 
sional array onto the two dimensional array by mapping 
every element in the depth dimension to the same virtual is 
processor. Thus, the number of virtual processors created is 
the same as the number of physical processors. All threads 
mapped onto a processor at the same depth execute on the 
same VP. The procedures get-pm-height and get-pm- width 
are provided by the physical machine interface. Absolute 20 
addressing of virtual processors is simply an array reference 
into the array returned by create-3D-mesh. 

The create- vp procedure creates a new VP running on the 
physical processor returned by get-pp. Having created a 
topology, it is possible to build self-relative addressing 25 
procedures off the current VP; fox example, it is possible to 
define an up- VP procedure that moves up one dimension in 
the topology thus: 

^ — — — — — ^— — . 30 

(define (upAT) 
(let ((address (vp-address (currcnt-virtuaJ-pcoowaor)))) 
(arr*y-rcf 3D-mesh (voctor-icf address 0))» 



1. Create a set of virtual processors which are mapped 35 
onto an appropriate physical processor. 

2. Associate an address in the virtual topology with each 
VP. 

3. Store the virtual processor in a data structure used for 
absolute addressing in the virtual topology, and define 40 
appropriate access routines on that structure. 

4. Define procedures for self-relative addressing. 
Thread Controller 

The thread controller handles the virtual processor* s inter- 
action with other system components such as physical 45 
processors and threads. The most important function of the 
thread controller is to handle the state transitions of threads. 
Whenever a thread makes a state transition that results in it 
yielding the virtual processor on which it is currently 
running, the thread controller calls the thread policy man- so 
ager to determine which thread to run next 

The implementation of the Sting thread controller high- 
lights a number of interesting issues. The central state 
transition procedures are shown in FIGS. 9 and 10. Opera- 
tions on TCBs found in these procedures are not available to 53 
user applications. Since the thread controller is written in 
Sting, all synchronous calls to TC procedures are treated as 
ordinary procedure calls; thus, live registers used by the 
procedure running in the current thread are saved automati- 
cally in the thread's TCB upon entry into the controller. 60 

The start-context- switch procedure (FIG. 8) takes the 
desired next state for the current thread (ie M the thread 
which has entered the TC) as its argument. 

Preemption is first disabled. Anew thread (or TCB ) is then 
returned by the procedure tpm-get-next-rhread. 65 

If there are no mnnable threads, the procedure returns 
false. In this case, the current thread is re-run (assuming it 
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is in a ready state), or the proceedure tpm-vp-tdle is called 
with the current VP as its argument The procedure tpra-vp- 
idle may perform various bookkeeping operations or it may 
request its physical processor to switch to another VP. 

If the object bound to next is the current TCB, no action 
is taken, and the current thread resumes immediately. If the 
object returned is another TCB, its state is to set to running, 
its VP field is set to the current VP, the current TCB is either 
recycled in the TCB pool (if its state is dead)* or its registers 
are saved, and the state of the new TCB is restored into 
processor registers. 

If the object returned is a thread that has no execution 
context, a TCB is allocated for it This TCB may be the 
current TCB if the next-state field is dead, or a TCB 
allocated from the VP-local or global pool The thread starts 
execution using the primitive procedure start-new-tcb; it 
applies the procedure start-new-thread (see FIG. 10) using 
the new TCB as its execution context 

The finish-context-switch code (FIG. 9) is executed by the 
thread returned by start-context-switch; its purpose is to 
release locks held by the switched-out thread (called previ- 
ous in the procedure), to set the VP field of the new thread, 
enqueue previous onto a ready queue if appropriate, and 
reestablish the p ree mp t i on timer. By enqueuing previous 
only after a new thread is established on a VP, the controller 
eliminates any race condition between effecting a state 
transition and enqueuing a thread onto a VP's ready queue. 
The procedures, tpm-enqueue-ready-thread and tpm- 
enqueue-suGpended-lhread are implemented by the thread 
policy manager. 

The code for start-new-thread is shown in FIG. It. A 
thread object with thuuk E r can begin evaluation once a TCB 
is allocated for it, and it becomes associated with a default 
error handier and appropriate cleanup code. Throws to exit 
(the catch point established by start-new-thread) from £, 
cause the thread stack to be unwound properly, thereby 
perrnitting resources such as locks held by the thread to be 
properly released. The exit code following the evaluation of 
Er garbage collects the thread stack and heap, stores the 
value yielded by E, as part of the thread state, wakes up all 
threads waiting for this value, and makes a tail recursive call 
to the state transition procedure to choose a new thread to 
run. Because Er is wrapped within a dynamic wind form, it 
is guaranteed that thread storage will be garbage collected 
even if a thread terminates abnormally. 

Garbage collection must take place before the thread's 
waiters are awakened because objects that outlive the thread 
(including the object returned by the thread's thunk) and 
contained its local heap must be migrated to another to do so 
would allow other threads to obtain references to the newly 
terminated thread's storage; this would clearly be erroneous 
since a determined thread's storage may be allocated to 
other threads. 
Thread Policy Manager 

Each virtual processor contains a thread policy manager. 
The thread policy manager makes all policy decisions relat- 
ing to the scheduling and migrations of threads on virtual 
processors. The thread controller is a client of the thread 
policy manager and it is inaccessible to user code. The 
thread controller calls the thread policy manager whenever 
it needs to make a decision concerning: the initial mapping 
of a thread to a virtual processor; which thread a virtual 
processor should run next when the current thread releases 
the virtual processor for some reason; or when and which 
threads to migrate to/from a virtual processor. 

While all virtual processors have the same thread 
controller, each virtual processor may have a different policy 
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manager. This ability is particularly important for real time (tpm-priority priority) and (tpm-quantum quantum) arc 

applications where each processor may be controlling a guard procedures that verify mat their arguments are a 

different subsystem with different scheduling requirements, valid priority or quantum respectively. 

The thread policy manager presents a well-defined inter- (tpm-allocate-vp vp thread) allocates thread on vp; if vp is 

face to the thread controller. The data structures mat the 5 false, thread is allocated on a virtual processor determined 

thread policy managers use to make their decisions are by the TPM. 

completely private to them. They may be local to a particular (tpm-vp-idlc up) is called by the thread manager if there are 

thread policy manager or shared among the various thread no evaluating threads on vp. This procedure can migrate 

policy managers, or some combination thereof, but they are a thread from another virtual processor, do bookkeeping, 

never available to any other part of the system. The thread 10 or call the physical processor to have the processor switch 

policy manager can thus be customized to provide different itself to another VP. 

behaviors far different virtual machines. This allows the user (tpm-enqueue-suspended-thread vp thread) suspends thread 

to customize policy decisions depending on the type of on vp's suspend queue. 

program being run. Besides determining a scheduling order for evaluating 

Since VPs can contain different thread policy managers, 15 threads, the TPM implements two basic load-balancing 

different groups of threads created by an application may be decisions: ( 1) it may choose a VP on which a newly created 

subject to different scheduling regimes, virtual machines or thread should be run, and (2) it determines which threads on 

virtual processors can be tailored to handle different sched- its VP can be migrated, and which threads it will choose for 

uling protocols or policies. migration from other VPs. 

The Sting thread controller defines a thread state transi- 20 The first decision point is important for handling initial 
tion procedure, but does not define a priori scheduling or load-balancing; the second is important for supporting 
load-balancing policies. These policies can be application dynamic load-balancing protocols. Dete rmining the initial 
dependent Although several default policies are provided as placement of a newly evaluating thread is often based on 
part of the overall Sting runtime environment, users are free priorities different from those used to determine the migra- 
te write their own policies. In fact referring to FIG. 3, each 25 tion of currently evaluating threads. The TPM interface 
virtu al processor 16 contains its own thread policy manager preserves this distinction. 

(TPM) 19; thus, different VPs in a given virtual machine Scheduling policies can be classified along several impor- 

may implement different policies. The TPM 19 handles tant dimensions: 

thread scheduling, processor/thread mapping, and thread Locality: Is there a single global queue of threads in this 

migration. 30 system, or does each TPM maintain its own local queues? 

The ability to partition an application into distinct sched- State: Are threads distinguished based on their current state? 

uling groups is important for long-lived parallel (or For example, an application might choose an implementa- 

interactive) programs. Threads executing I/O bound proce- tion in which all threads occupy a single queue regardless of 

dures have different scheduling requirements from those their current state. Alternatively, it might choose to classify 

executing compute bound routines; applications with real- 35 threads into different queues based on whether they are 

time constraints should be implemented using different evaluating, scheduled, previously suspended, etc. 

scheduling protocols than those that require only a simple Ordering: Are the queues implemented as FIFO's, LJPCTs, 

FIFO scheduling policy. round-robin, priority, or realtime structures (among others)? 

Tree- structured parallel programs may perform best using Serialization: What kind of locking structure does an appli- 

a UFO- based scheduler, applications running master/slave 40 cation impose on various policy manager queues? 

or worker farm algorithms may do better using a round- Choosing different alternatives in this classification 

robin preemptive scheduler for fairness. Since all of these scheme results in different performance characteristics. For 

applications may be components of a larger program struc- example, if a granularity structure is adapted that distin- 

ture or environment the flexibility afforded by having them guishes evaluating threads (Le. f threads with TCBs) from 

evaluate with different policy managers is significant Dis- 45 scheduled ones, and the constraint that only scheduled 

Unct applications can exist as independently executing col- threads can be migrated is imposed, then no locks are 

lections of threads evaluating on the same virtual machine. required to access the evaluating thread queue; this queue is 

Moreover, each distinct scheduler can have a thread policy local to the VP on which it was created. Queues holding 

manager with different performance characteristics and scheduled threads however must be locked because they arc 

implementation concerns. 50 targets for migration by TPMs on other VPs. This kind of 

The present invention seeks to provide a flexible frame- scheduling regimen is useful if dynamic load-balancing is 

work that is able to incorporate different scheduling regimes not an issue. Thus, when there exist many long-lived non- 

transparentry to the user without requiring modification to blocking threads (of roughly equal duration), most VPs will 

(he thread controller itself. To this end, all TPMs must be busy most of the time executing threads on their own 

conform to the same interface although no constraints are 53 local ready queue. Eliminating locks on this queue in such 

imposed on the implementations themselves. The interface applications is therefore beneficial. On the other hand, 

set forth below provides operations for choosing a new applications that generate threads of varying duration may 

thread to run, enqueuing an evaluating thread, setting thread exhibit better performance when used with a TPM that 

priorities, and migrating threads. These procedures are permits migration of both scheduled and evaluating threads 

expected to be used exclusively by the TC; in general, user 60 even if there is an added cost associated with locking the 

applications need not be aware of the thread policy manager/ runnable ready queue. 

thread controller interface. Global queues imply contention among thread policy 

(tpm-get-next-thread vp) returns the next ready thread to run managers whenever they need to execute a new thread, but 

00 vp* such an implementation is useful in implementing many 

(tpm-enqueue-ready-thread vp obj) enqueues obj which may 65 kinds of parallel algorithms. For example, in master/slave 

be either a thread or aTCB into the ready queue of the (or workerfarm) programs, the master initially creates a pool 

TPM associated with vp. of threads; these threads are long-lived structures that do not 
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spawn any new threads themselves. Once running on a VP, message receipt is not encoded as part of the underlying 

they rarely block. Thus, a TPM executing such a thread has implementation but determined by the message itself. There 

do need to support the overhead of maintaining a local is a great deal of flexibility and simplicity afforded by such 

thread queue. Local queues are useful, however, in imple- a model since the virtual processoiAnrcad interface does not 

meeting result-parallel programs in which the process struc- 5 require any alteration in order to support message commu- 

ture takes the form of a tree or graph; these queues can be nkatioxL IWo aspects of the Sting design are crucial for 

used in such applications to load balance threads fairly realizing this functionality: (1) the fact that objects reside in 

among a set of virtual processors. a shared virtual memory allow all objects (including those 

Message-Passing Abstractions containing references to other objects je.g., closures) to be 

Message-passing is to be an efficient commiinication 1Q ^5^^ among virtual processors freely, and (2) first- 
mechanism on disjoint memory architectures, especially for cJass ^^050^5 permit complex user-defined message han- 
parallel appUcations that are coarse-grained or have known ^ ^ ^ 00Bf|II|Ctod; mesc ^ execute in a 
oommumcation .patterns. A ^J^^^S^To separate thread on any virtual processor. On distributed 
vided in Sting to ininimizc the °vcrhc*fc of i^kmcnting ^™ machines, objects would reside in a distributed 
shared memory on disjoint memory architectures. First-class 7 ^ J^T;.. . . . it _ , , 
So^e^3 ports Exhibit a^nergy in this context. « shared virtual memory. To illustrate, in the above example, 
^S^l^ws I m^e-passing^actions to be inte- B may be a complex query of a J^ase. If a receiver is 
grated within a shared-memory environment A port is a instantiated on the processor on which the database resides, 
first-class data object that serves as a receptacle for mes- such queries do not involve expensive migration of the 
sages that may he sent by other threads. Since Sting uses a database itself. Communication costs are reduced because 
shared virtual memory model, any complex data structure 20 queries are directly copied to the processor on which the 
(including closures) can be communicated through a port database resides; the database itself does not need to migrate 
This flexibility permits Sting applications to implement user to processors executing queries. The ability to send proce- 
level message-passing protocols transparently and to com- dures to data rather than more traditional RPC-style com- 
bine the best features of shared memory and message munication leads to a number of potentially significant 
passing within a unified environment 25 performance and exrxessivity gains. 

Ports are first class data structures. There are two basic First-class procedures and lightweight threads make 

operations provided over ports: active message passing an attractive high-level communi- 

1. (put obj port) copies obj to port The operation is cation abstraction. In systems that support active messages 
asynchronous with respect to the sender. without the benefit of these abstractions, this functionality is 

2. (get port) removes the first message in port, and blocks 30 typically realized in terms of low-level support protocols, 
if port is empty. First-class procedures make it possible to implement active 

Objects read from a port P are copies of objects written to messages trivially. An active message is a procedure sent to 

P. This copy is a shallow copy Le., only the top-level a port First-class ports have obvious and important utility in 

structure of the object is copied, the substructure is shared. distributed computing environments as well and lead to a 

The ports are designed with copying semantics because they 35 simpler and cleaner programming model than traditional 

are designed to be used when shared memory would be RpC. 

inefficient. While the standard version of put does a shallow Memory Management 

copy, there is also a version available that does a deep copy; Sting uses a shared virtual memory model. Implementa- 

this latter version not only copies the top-level object but tions of Sting on distributed memory platforms must be built 

also all its substructures. 40 on top of a distributed shared virtual memory substrate. 

For example, sending a closure in a message using Thus, the rneaning of a reference does not depend on where 

shallow copying involves constructing a copy of the closure mc reference is generated, or where the object is physically 

representation, but preserving references to objects bound located, 

within the environment defined by the closure. The choice of Storage Organization 

copy mechanisms used clearly is influenced by the under- 4S u sting, there are three storage areas associated with 

lying physical architecture, and the application domain. TCB 32 (FIG. 5). The first, a stack 31, is used to 

There are a range of message transmission implementations allocate objects created by the thread whose lifetime does 

that can be tailored to the particular physical substrate on not exceed the dynamic extent of its creator. More precisely, 

which the Sting implementation resides. objects allocated on a stack may only refer to other objects 

Thus, evaluating the expression, 50 chat are allocated in a current (or earlier) stack frame, or 

which are allocated on some heap. Stack allocated objects 

(^(i^o^imo caD j-gfej. tQ objects in heaps because the thread associated 

^ with the stack is suspended while the heap 33 is garbage 

transmits the closure of the procedure (lambda ( )B) to port collect ed; references contained in stacks are part of the root 

If a receiver is denned on port thus, 55 set traced by the garbage collector. 

Thread private or local heaps 33 are used to allocated 

M*fi«i, t • non-shared objects whose lifetimes might exceed the life- 

(fet ((Bwg^port))) time of the procedures that created them. The term "might 

(fcxt-ttnead (msg) (cunent-vp» exceed" is used because it is not always possible for the 

(woehw))) go compiler to determine the lifetime of an object in program- 

— — — — — — ^— — — ming languages such as Scheme or ML. Furthermore, it may 

the procedural object sent is evaluated on the virtual pro- not be possible to determine the lifetimes of objects in 

cesser of the receiver. By creating a new thread to evaluate languages which allow calls to unknown procedures. Rcf- 

messages, the receiver can accept new requests concurrently erences contained in private heap can refer to other objects 

with the processing of old ones. 65 in the same private heap, or objects Ln shared or global heaps 

This style of communication has been referred to as 35, but they cannot refer to objects in the stack 31. Refer- 

"active messages" since the action mat should be taken upon ences in the stack may refer to objects in the private heap. 
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but references in the shared heap may not ftivate heaps lead 
to greater locality since data allocated on them are used 
exclusively by a single thread of control; the absence of 
interleaving allocations among multiple threads means that 
objects dose together in the heap are likely to be logically 5 
related to one another. 

No other thread can access objects that are contained in a 
thread* s stack or local heap. Thus, both thread stacks and 
local heaps can be implemented in local memory on the 
processor without any concern for synchronization or 10 
memory coherency. Thread local heaps are actually a series 
of heaps organized in a generational manner. Storage allo- 
cation is always done in the youngest generation in a manner 
similar to other generational collectors. As objects age they 
are moved to older generations. All garbage collection of the 
local heap is done by the thread itself. In most thread 15 
systems that support garbage collection all threads in the 
system must be suspended during a garbage collection. In 
contrast, Sting's threads garbage collect their local heaps 
independently and asynchronously with respect to other 
threads. Thus other threads can continue their computation 20 
while any particular thread collects its local heap; this leads 
to better load balancing and higher throughput. A second 
advantage of this garbage collection strategy is that the cost 
of garbage collecting a local heap is charged only to the 
thread that allocates the storage, rather than to all threads in 25 
the system* 

Sting provides "thread groups'* as a means of gaining 
control over a related collection of threads. Every thread Is 
associated with some thread group. A child thread is in the 
same group as its parent unless it is created as part of a new 30 
group. Thread groups provide operations analogous to ordi- 
nary thread operations (e.g., termination, suspension, etc) as 
well as operations for debugging and monitoring (e.g., 
listing all threads in a given group, listing all groups, 
profiling, genealogy information, etc.) In addition, a thread 35 
group also includes a "shared heap" accessible to all its 
members. 

A shared heap or global heap 35 of a thread group is 
allocated when the thread group is created. The shared heap 
like the local heap is actually a series of heaps organized in 40 
a generational manner. References in shared heaps may only 
refer to the objects in shared heaps. This is because any 
object that is referenced from a shared object is also a shared 
object and, therefore must reside in a shared heap. This 
constraint on shared heaps is enforced by ensuring that 45 
references stored in shared heaps refer to objects that are (a) 
either in a shared heap, or (b) allocated in a local heap and 
garbage collected into a shared one. That is, the graphs of 
objects reachable from the referenced object must be copied 
into or located in the shared heap. Hie overheads of this 50 
memory model depend on how frequently references to 
objects allocated on local heaps escape. Experience has 
shown that in implementing fine-grained parallel programs, 
most objects allocated on a local heap remain local to the 
associated thread, and are not shared. Those objects that are 55 
shared among threads often are easily detected either via 
language abstractions or by comrnle-time analysis. 

To summarize, the reference discipline-observed between 
the thread areas associated with a thread are as follows: (1) 
references in the stack refer to objects in its current or 60 
previous stack frame, its local heap, or its shared heap. (2) 
references in the local heap refer to objects on that heap or 
to objects allocated on some shared heap, and (3) references 
in the shared heap refer to objects allocated on its shared 
heap (or some other shared heap). 63 

Like local heaps, global heaps are organized in a genera- 
tional manner, but garbage collection of global heaps is more 
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complicated than mat for local heaps because many different 
threads can simultaneously access objects contained in the 
global heap. Note that as a result, global heap allocation 
requires locking the heap. 

In order to garbage collect a global heap, all threads in the 
associated thread group (and its inferiors) are suspended 
This is because any of these threads can access data in the 
global heap. However, other threads in the system, Le., those 
not interior to the group associated with the heap being 
collected, continue execution independent of the garbage 
collection. 

Each global heap has a set of incoming references asso- 
ciated with it These sets are maintained by checking for 
stores of references that cross area boundaries. After the 
threads associated with the global heap have been 
suspended, the garbage collector uses the set of incoming 
references as the roots for the garbage collection. Any 
objects reachable from the incoming reference set are copied 
to the new heap. When the garbage collection is complete 
the threads associated with the global heap are resumed. 
Abstract Physical Machines and Abstract Physical Proces- 
sors 

The operating system's lowest-level abstraction is a 
micro-kernel called the Abstract Physical Machine (APM). 

The APM plays three important roles in the Sting software 
architecture: 

1. It provides a secure and efficient foundation for sup- 
porting multiple virtual machines. 

2. It isolates all other components in the system from 
hardware dependent features and idiosyncrasies. 

3. It controls access to the physical hardware of the 
system. 

The APM is rmplemented within a special virtual machine 
called the root virtual machine. This machine has access to 
all facilities available in any other virtual machine including 
a virtual address space, virtual processors, and threads. In 
addition, the root virtual machine has access to abstract 
physical processors, device drivers, and a virtual memory 
manager. The fact that the abstract physical machine is 
organized in terms of virtual machines leads to some impor- 
tant expressivity gains. There are no heavyweight threads. 
All threads are lightweight There are no kernel threads or 
stacks far implementing system calls. All system calls are 
handled using the execution context of the thread making the 
system calL This is possible because Scheme is a safe 
language (Le., dangling pointers, free coercion between 
address and data, etc., are not possible), and portions of the 
APM are mapped into every virtual machine in the system. 
Asynchronous programming constructs available to user 
threads are also available to threads found in the APM. 
APM -related threads can be controlled in the same manner 
as any other thread in a virtual machine. Threads which 
block executing a kernel operation inform their virtual 
processor of this fact. The VP is then free to execute some 
other thread. This is true for both inter-thread communica- 
tion and I/O; Sting's treatment on non-blocking kernel calls 
provides the same capability as e.g., scheduler activations, 
or Psyche's virtual processor abstraction. 

Virtual machines are created and destroyed by the APM. 
Creating a new virtual machine entails the following: 

1. Creating a new virtual address space, 

2. Mapping the APM kernel into this address space, 

3. Creating a root virtual processor in this virtual machine, 

4. Allocating abstract physical processors to this machine, 

5. Scheduling the root virtual processor to run on an 
abstract physical processor. 
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Destroying a virtual machine entails generating signals to a ready VP is capable of running, but is currently not A 

terminate all running threads on that machine, closing any blocked VP is executing a thread waiting on some external 

devices opened by threads executing in the machine, and event (eg. I/O). The VPPM is responsible for scheduling 

finally deallocating the virtual address space associated with VPs on a physical processor; its scheduling policies are 

this machine. 5 similar to those used by a TPM. The VPPM presents a 

Each processor abstraction 12 is composed of a virtual well-defined interface to the VP controller, different Sting 

processor controller (VPQ 13 and virtual processor policy systems can contain different VP policy managers, 

manager (VPPM) 15. The relationship between the VP Exception Handling 

controller and the VP policy manager is similar to that Synchronous exceptions and interrupts are handled uni- 

betwecn the thread controller and the thread policy manager, 10 forrnly in Sting. Associated with every exception is a handler 

ie. the VP controller is a client of the VP policy manager. responsible for performing a set of actions to deal with the 

Whenever the VP controller needs to make a policy decision exception. Handlers are procedures that execute within a 

it calls the VP policy manager to make that decision. thread. An exception raised on processor P executes using 

While all physical processors run the same VP controller, the context of P*s current thread. There arc no special 

they can run different VP policy managers. This allows a 15 exception stacks in the Sting micro-kernel, 

multiprocessor system to customize the system's use of each when an exception (eg., invalid instruction, memory 

physical processor. U is also possible for the system to run protection violation, etc.) is raised on processor P, Fs 

the same VP policy manager on each of the physical current continuation (Le., programs counter, heap frontier, 

processors. stack, etc.) is first saved The exception dispatcher then 

When a virtual machine wishes to schedule a virtual 20 proceeds to find the target of the exception, interrupting it if 

processor on an abstract physical processor it calls die ^ ^ num^ and pushing the continuation of the 

virtual processor controller on that physical processor. handler and its arguments onto the target thread's stack. 

Likewise, when a virtual machine wishes to remove a virtual Ncxt ^ dispatcher may choose to (a) resume the current 

processor from an abstract physical processor it calls the thread by simply returning into it, (b) resume the target 

virtual processor controller on that physical processor. Each 25 thread, or (c) call the thread controller to resume some other 

VP controller manages the virtual processors which are thread oo this processor. When the target thread is resumed, 

mapped onto its physical processor, including all virtual it ^ mc continuation found on the top of its stack; 

processor state changes. this, is the continuation of the exception handler. 

TTie yppoticy manager makes all policy decisions reiat- ^ exceptions in Sting is novel in 

ing to the scheduling and migration of virtual processors on 30 s^end respects- 
physical processors. There are three types of decisions: First 

it determines the VP to PP map. The mapping takes place at i ' Handling exception simply involves calling it since 

two distinct times, when the VP is run for the first time and itisa procedure. 

when a VP which has been blocked is rerun. Second, the 2. Exceptions are handled in the execution context of the 

policy manager also determines the order in and duration for 35 thread receiving it 

which VPs on a PP are run. Finally, the VP policy manager 3. Exceptions are dispatched in the context of the current 

decides when a VP should be moved (migrated) from one thread. 

processor to another. 4. Exceptions once dispatched become the current con- 
These three decisions allow the VP policy manager to tinuation of the target thread and are executed auto- 
balance the work load on a machine and Determine the 40 matically when the thread is resumed 

fairness properties of the physical machine with respect to - A . . „ , u . A . t u . . 

virtual rnacfames. They also allow VP policy inana^ers to 5 * ^J™** 00 15 ***** on * *»» the target thread is 

decide where to move the VPs of a fault tolerant VM when resumea, 

a physical processor fails. 6> Exception handling code is written in Scheme and 

Like the thread policy manager the VP presents a well- 45 manipulates continuations and procedures to achieve 

defined interface to the VP controller. The data structures desired effect. 

which the VP policy manager uses to make its decisions are Sting is able to provide this model of exceptions because 

completely private to it These data structures may be local nrst-class procedures and threads, manifest continuations, 

to a particular VP policy manager or shared among the dynamic storage allocation, and a uniform addressing 

various instances of the VP policy manager, or some com- so mechanism are all central features of its design, 

bination thereof, but no other component of the system has Thc of a synchronous exception is always the 

access to them The VP policy manager can be customized current thread. Asynchronous exceptions or interrupts are 

to provide different behaviors to different instances of Sting. treated slightly differently. Since interrupts can be directed at 

This functionality allows it to be customized for different ^ thread (not just the currently executing one), handling 

operating system environments as diverse as real time, 55 5Uch exceptions requires the handler to either process the 

interactive, or computationally intensive systems. exception imm e dia tely, interrupt the currently running 

Finally, while the thread policy manager is concerned fona6 to . handle the exception, or create a new handler 

with load balancing and fairness among threads, the virtual thread. Since interrupt handlers are also Scheme procedures, 

processor policy manager is concerned with load balancing establishing a thread to execute the handler or using a 

and fairness among virtual machines and virtual processors. 60 curat* thread for mat purpose merely involves setting die 

Each physical processor inan APM includes a virtual current continuation of the appropriate thread to call die 

processor controller (VPQ, and a virtual processor policy handler. The pseudo code for a Sting exception dispatcher is: 
manager (VPPM). In this sense, physical processors are 

structurally identical to virtual processors. The VPC effects i: (defi» (e*cept«wWk^ type trgs) 

state changes on virtual processors. Like threads, virtual 65 & (uv»<wrat-«oatcnutk>a) 

processors may be running, ready, blocked or terminating. A 3: (let ((tuget baarfUr (get-taiBet&haudkr type 
running VP is currently executing on a physical processor; 
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-continued 



4: 


(coud ((oq? tvtgot (current-thread)) 


5: 


(apply handltr or js» 


6: 


(else 


7: 


(signal Ufgct handler «gs) 


8: 


(case ((cxceptioo-priarily typo)) 


9: 


((continue) (return)) 


10: 


((immediate) (switcb-ttMhxead target)) 


11: 


((reschedule) (yield-procossor))))))) 



In line 2, the current continuatioa is saved on the stack of the 
current thread The continuation can be saved on the stack 
because it cannot escape and it will only be called once. On 
line 3 the dispatcher finds the thread for which the exception 
is intended and the handler for the exception type, line 4 
checks to see if the target of the exception Is the current 
thread and if so does not push the exception continuation 
(line 5). Rather, the dispatcher simply applies the handler to 
its arguments. This is valid since the dispatcher is already 
running in the context of the exception target, Le. the current 
thread. If the target of the exception is not the current thread, 
the dispatcher sends the exception to the target thread (line 
7). Sending a thread a signal is equivalent to interrupting the 
thread and pushing a continuation containing the signal 
handler and its arguments onto the thread's stack, and 
resuming the thread which causes the signal handier to be 
executed After signaling the target thread, the handier 
decides which thread to run next on the processor (line 8). 
It may be itself (line 9), the target thread (line 10), or the 
thread with the highest priority (line 11). 

There is one other important distinction between Sting's 
exception handling facilities and those found in other oper- 
ating systems. Since threads that handle exceptions are no 
different from other user-level threads in the system (eg. 
they have their own stack and heap), and since exception 
handlers are ordinary first-class procedures, handlers are 
free to allocate storage dynamically. Data generated by a 
handler will be reclaimed by a garbage collector in the same 
way that any other datum is recovered. The uniformity 
between the exception handling mechanism and higher-level 
Sting abstractions allows device driver implementors 
expressivity and efficiency not otherwise available in paral- 
lel languages or operating systems. 

Sting is able to provide this model of exceptions because 
first-class procedures and threads, manifest continuations, 
dynamic storage allocations, and a uniform addressing 
mechanism are all central features of its design. 
Concurrency Paradigms 

Having provided detailed description of the software 
architecture, several diverse concurrency paradigms will be 
expressed and implemented with the present software archi- 
tecture. 

In a result parallel program, each concurrently executing 
process contributes to the value of a complex data structure 
(e.g., an array or list), or is a member of a complex process 
graph. Process communication is via this result structure or 
graph. Expressions that attempt to access a component of the 
result whose contributing process is still evaluating block 
until the process completes. 

Futures are a good example of an operation well-suited 
for implementing result parallel algorithms. The object 
created by the MultiLisp or Mul-T expression, (future E), 
creates a thread responsible for computing E; the object 
returned is known as a future. When E finishes, yielding v 
as its result, the future is said to be determined. An expres- 
sion that touches a future either blocks if E is still being 
computed or yields v if the future is determined. 

In a naive implementation of the sorting program given in 
FIG. 11 each instantiation of a future will entail the creation 
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of a new thread. This behavior is undesirable because a 
future computing at level i in the process tree has a manifest 
data dependence with its children at level i+1 and so on. 
Poor processor and storage utilization will result, given the 

5 data dependencies found in mis p r ogr am . This is because 
many of the lightweight processes mat are created will 
either: need to block when they request the value of other as 
of yet unevaluated futures or, in the case of processes of 
computing small primes, far example, do a small amount of 

10 computation relative to the cost incurred in creating them. 
Because the dynamic state of a thread consists of large 
objects (eg., stacks and heaps), cache and page locality is 
compromised if process blocking occurs frequently or if 
process granularity is too small 

is The semantics of touch and future dictate that a future F 
which touches another future G must block on G if G is not 
yet determined. Assume T F andT 0 are the thread represen- 
tation of F and G, respectively. The runtime dynamics of the 
touch operation on G can entail accessing T G either when T G 

20 is (a) delayed or scheduled, (b) evaluating, or (c) deter- 
mined. In the latter case, no synchronization between these 
threads is necessary. Case (b) requires T F to block until T 0 
completes. Sting performs an important optimization for 
case (a), however, which is discussed below. 

25 Tjr can evaluate the closure encapsulated within T 0 (call 
it E) using its own stack and heap, rather than blocking and 
forcing a context switch. In effect, this implementation treats 
E as an ordinary procedure, and the touch of G as a simple 
procedure call; it is said that T F absorbs T 0 in this case. The 

30 correctness of this optimization lies in the observation that 
Tjr would necessarily block otherwise; by applying E using 
T F *s dynamic context, the VP on which T F executes does not 
incur the overhead of executing a context switch. In 
addition, no TCB need be allocated for T c since T/s TGB 

35 is used instead. 

This optimization may only lead to observably different 
results if used in instances where the calling thread need not 
necessarily block. For example, suppose T G was an element 
of a speculative call by T^ Furthermore, assume T a 

40 diverges, but another speculative thread (call itT w ) does not 
In the absence of absorption, both T a and T N would spawn 
separate thread contexts. In the presence of absorption, 
however, T F may absorb T 0 and thus will also loop because 
T 0 does. Users can parameterize thread state to inform the 

45 TC if a thread can absorb or not; Sting provides interface 
procedures for this purpose. 

Because of absorption, Sting reduces the overhead of 
context switching, and increases process granularity for 
programs in which processes exhibit strong data dependen- 

50 cies among one another. Of course, for the operation to be 
most effective, thread granularity must be sufficiently large 
to permit scheduled threads to become absorbed; if process 
granularity is too small, processors will begin evaluation of 
threads that may potentially be absorbed before the absorb- 

35 ing threads can demand their values. 

Load-based inlining and lazy task creation are two other 
similar optimizations that have been applied in other parallel 
Lisp systems. Load-based inlining causes a thread to be 
inlined (Le., absorbed) if the current system load exceeds 

60 some specified threshold. Not only docs mis optimization 
require programmer involvement, under certain conditions it 
may induce deadlock or starvation for programs which 
would otherwise terminate. This is because the inlining 
decision is irrevocable; thus, it imposes a specific evaluation 

65 order on tasks whose data dependencies might require a 
different evaluation order. Thread absorption does not suffer 
from this problem since absorption occurs only when a 
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thread would otherwise block, and only when data depen- 
dencies warrant 

Lazy task creation solves many of the problems associ- 
ated with load-based inlining — it always inlincs the evalu- 
ation of every thread, but permits this inlining operation to 5 
be revocable if processors become idle. Threads are never 
created unless actually needed. This scheme requires no 
programmer intervention* does not induce deadlocks in 
programs which would otherwise not exhibit them, and 
reduces the number of tasks actually generated. ]Q 

Thread absorption differs from lazy tasks in two major 
respects: (1) Thread absorption works even in the presence 
of scheduling protocols determined by the application; lazy 
task creation assumes a global UFO schedule, and the 
presence of a single queue to hold inlined threads. (2) Lazy 
task creation uses one global heap per processor. This 13 
implementation results in less locality when a task is stolen 
than occurs with thread absorption. Secondly, garbage col- 
lection in the presence of lazy task creation requires all 
threads in the system to be stopped (even though the 
collector itself may be parallel). This constraint does not 20 
apply for thread absorption. 

Another example is the master-slave paradigm which Is a 
popular parallel program structuring technique. In this 
approach, the collection of processes generated is bounded 
a priori; a master process generates a number of worker 25 
processes and combines their results. Process communica- 
tion typically occurs via shared concurrent data structures or 
variables. Master-slave programs often are more efficient 
than result parallel ones on stock multiprocessor platforms 
because workers rarely need to communicate with one 30 
another except to publish their results, and process granu- 
larity can be better tailored for performance. 

Sting has been used to build an optimizing implementa- 
tion of first-class tuplespaces in Scheme. A tuple-space is an 
object that serves as an abstraction of a synchronizing 35 
content-addressable memory; tuple-spaces are a natural 
implementation choice for many master/slave-based algo- 
rithms. 

Since tuples are objects and tuple-operations are binding 
expressions, not statements, the presence of first-class deno- 40 
table tuple-spaces results in added modularity and expres- 
sivity. In the preferred implementation, tuple-spaces can be 
specialized as synchronized vectors, queues, streams, sets, 
shared variables, semaphores, or bags; the operations per- 
mitted on tuple-spaces remain invariant over their represen- * 5 
tation. In addition, applications can specify an inheritance 
hierarchy among tuple-spaces if so desired. 

Processes can read, remove or deposit new tuples into a 
tnple- space. The tuple argument in a read or remove opera- 
tion is called a "template" and may contain variables pre- so 
fixed with a 4 T\ Such variables are referred to as "formals* 
and acquire a binding-value as a consequence of the match 
operation. The bindings acquired by these formals are used 
in the evaluation of a subordinate expression: thus, it is 
possible to write: 53 



(getTS[7x) 

(put TS [(+*!)])) 



to remove atomically a singleton tuple from TS, increment 
it by one, and deposit it back into TS. 

The present implementation also takes advantage of 
thread absorption to permit die construction of fine-grained 
parallel programs mat synchronize on tuplespaces. Threads 65 
are used as bona fide elements in a tuple. Consider a process 
P that executes the following expression: 
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tvi TB(ml x2 JB) 

where xl and x2 are non-f ormals. Assume furthermore that 
a tuple in TS is deposited as a consequence of the operation: 
(spawn TS{B 1 Ey). This operation schedules two threads (call 
them T A and T^j) responsible for computing E A and E,. If 
bothT £| and T^ complete, the resulting tuple contains two 
determined threads; the matching procedure applies thread- 
value when it encounters a thread in a tuple; this operation 
retrieves the thread's value. 

J£T £ is still scheduled at the time P executes, however, 
Pis free 1 to absorb it, and then determine if its result matches 
xl. If a match does not exist, P may proceed to search for 
another tuple, leaving Tj^ potentially in a scheduled state. 
Another process may subsequently examine this same tuple 
and absorb T^ if warranted. Similarly, if T^ 's result 
matches xl, P is then free to absorb T^. If either T Bt or T^ 
are already evaluating, P may choose to either block on one 
(or both) thrcad(sX or examine other potentially matching 
tuples in TS. The semantics of tuple-spaces imposes no 
constraints on the implementation in mis regard. 

Sting's combination of first-class threads and thread 
absorption allows the writing of quasi-demand driven fine- 
grained (result) parallel programs using shared data struc- 
tures. In this sense, the thread system attempts to niimmize 
any significant distinction between structure-based (e.g., 
tuple-space) and data-flow style (e.g., future/touch) synchro- 
nization. 

Speculative parallelism is an important programming 
technique mat often cannot be effectively utilized because of 
runtime overheads incurred in its imptementation. The two 
features most often associated with systems that support a 
speculative prc^rainniing model are the ability to favor 
certain more promising tasks over others, and the means to 
abort, reclaim (and possibly undo) unnecessary computa- 
tion. 

Sting permits programmers to write speculative applica- 
tions by: 

1. allowing users to explicitly program thread priorities, 

2. permitting a thread to wait on the completion of other 
threads, and 

3. allowing threads to terminate other threads. 
Promising tasks can execute before unlikely ones because 

priorities are prograrnmable. A task a that completes first in 
a set of tasks can awaken any thread blocked on its comple- 
tion; this functionality permits Sting to support a useful form 
of OR-parallelism. Task a can terminate all other tasks in its 
task set once it has been determined that their results are 
unnecessary. Speculative computation using Sting however, 
will not be able to undo non-local side-effects induced by 
useless tasks; the system does not provide a primitive 
backtracking merfmniffm. 

Consider the implementation of a wait-for-one construct 
This operator evaluates its list of arguments concurrently, 
returning the value yielded by the first of its arguments to 
complete. Thus, if a, yields v in the expressions: (wait-for- 
one c^Oj . . . ctj , . . ctj the expressions returns v, and, if 
desired by the programmer, terminates the evaluation of all 
the remaining Op }*L 

The specification of a wait-for-all construct that imple- 
ments and AND-parallel operation is similar; it also evalu- 
ates its arguments concurrently, but returns true only when 
all its arguments complete. Thus the expression: (wait-for- 
all a fa ... a, ... ctj acts as a barrier synchronization point 
since the thread executing this expression is blocked until all 
the otf is complete. The implementation of this operation is 
very similar to the implementation of the speculative wait- 
for-one operation. 
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The TC implements these operations using a common 
procedure, block-on-set Threads and TCBs are defined to 
support this functionality. For example, associated with a 
TCB structure is information on the number of threads in the 
group that must complete before the TCB's associated 5 
thread can resume. 

Block-on-set takes a list of threads and a count These 
threads correspond to the arguments of the wait-for-one and 
wait-for all operations described above; the count argument 
represents the number of threads that must complete before 10 
the current thread (i.e., the thread executing block-on-set) is 
allowed to resume. If the count is one, the result is an 
implementation of wait-for-one; if the count is equal to n, me 
result is an implementation of wait-far all. 

The relationship between a thread T g in the set and the 15 
current thread (T J that is to wait on T is maintained in a data 
structure (called a thread barrier (TB)) that contains refer- 
ences 

1. T w * s TCB 

2. the TB of another waiter blocked on T^ (if one exists). 20 
A program defining block-on-set is shown in FIG. 12. 
The call: 
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causes the current thread (call it T) to unblock upon the 
completion of m of the T„ m^n. Each of these T, have a 
reference to T in their chain of waiters. 

Applications use block-on-set in conjunction with a 
wakeup-waiters procedure that is invoked by the a, when 30 
they complete. Wakeup-waiters examines the list of waiters 
chained from the waiters slot in its thread argument A waiter 
whose wait-count becomes zero is enqueued on the ready 
queue of some VP. The TC invokes wakeup-waiters when- 
ever a thread T completes (e.g., whenever it terminates or 35 
abnormally exits). All threads waiting on T's completion are 
thus rescheduled. 

Given these two procedures wait-for-one can be defined 
simply: 

40 

(define (wait-fcr-ooe • block-group) 
(bkxk-on-gnmp 1 block-group) 
(map tfirryMrnnT"*^ block-group)) 
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If T executes wait-for-one, it blocks on all the threads in its 
block-group argument When T is resumed, it is placed on a 
queue of ready threads in the TPM of some available virtual 
processor. The map procedure executed upon T's resump- 
tion terminates all threads in its group. 50 

Sting's wait-for-all procedure can omit this operation 
since all threads in its blockgroup are guaranteed to have 
completed before the thread executing this operation in 
resumed. 

Sting has been implemented on both an 8 processor 35 
Silicon Graphics PowerSeries (MIPS R3000) and a 16 
processor Silicon Graphics Challenge (MIPS R4400). Both 
machines are shared-memory (cache-coherent) multiproces- 
sors. The abstract physical machine configuration maps 
physical processors to lightweight Unix threads; each pro- 60 
cessor in the machine runs one such thread. 

While there has been described and illustrated a preferred 
embodiment of a computer software architecture. It will be 
apparent to those skilled in the art that variations and 
modifications are possible without deviating from the broad 65 
principles and spirit of the invention which shall be limited 
solely by the scope of the claims apeoded hereto. 



What is claimed is: 

1. A software architecture for controlling a highly parallel 
computer system comprising: 

abstract physical machines comprising abstract physical 

processors forming a microkernel; 
virtual machines associated with respective abstract 

physical processors, said virtual machines comprising 

virtual processors; and 
thread groups comprising threads which run on said 

virtual processors, 
where said virtual processors and said threads are first 

class objects. 

2. A software architecture as set forth in claim 1, where 
said virtual processors are connected in a virtual topology. 

3. A software architecture as set forth in claim 1, where 
said microckernel policy manager is user customizable. 

4. A software architecture as set forth in claim 1, where 
said virtual processors contain which thread policy manag- 
ers are user customizable. 

5. A software architecture as set forth in claim 1* where 
said threads, said virtual processors and said abstract physi- 
cal processors are operatively associated for constructing a 
virtual topology. 

6. A software architecture as set forth in claim 5, where 
said virtual topology is user customizable. 

7. A software architecture as set forth in daim 1, where 
said threads are separable from their respective execution 
contexts for allowing delayed allocation of execution con- 
texts. 

8. A software architecture as set form in claim 1, further 
comprising polymorphic ports. 

9. A architecture as set forth in claim 8, where said is a 
class object 

10. A software architecture as set forth in claim 8. where 
said threads send messages containing general data and 
complex data, 

11. A software architecture as set forth in claim 1, where 
said threads garbage collect their respective local stacks and 
heaps independent of other threads. 

1Z A software architecture as set forth in claim 1, where 
thread groups collect their respective shared heaps indepen- 
dently of unrelated thread groups. 

13. A software architecture as set forth in claim 1, where 
said virtual processors are multiplexed on said abstract 
physical processors. 

14. A software architecture as set forth In claim 1, where 
said virtual processors said virtual machines and said threads 
reside in a persistent memory. 

15. A software architecture as set forth in claim 1. where 
said abstract physical processors are first class objects. 

16. A software architecture as set form in claim 15. where 
said virtual machines are first class objects. 

17. A software architecture as set form in claim 16, where 
said abstract physical machines and said thread groups are 
first class object 

18. A computer system comprising: 

a plurality of customizable abstract physical processors 
connected in a customizable physical topology each 
containing a customizable virtual processor controller 
and a customizable virtual processor policy manager; 

a plurality of virtual machines each comprising a virtual 
address space whose topology Is user specifiable and 
further comprising virtual processors which execute 
responsive to said virtual processor controller and said 
virtual processor policy manager and which contain a 
thread controller and a thread policy manager, said 
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virtual processors being connected in a dynamic virtual 
topology and each virtual processor being mapped onto 
a respective abstract physical processor which mapping 
may be dynamically altered without modification to the 
implementation; and 
a plurality of threads for running on said virtual proces- 
sors responsive to said thread controllers and thread 
policy managers. 
19. A computer system as set forth in claim 18, where said 

virtual processors are multiplexed on said abstract physical 

processors. 
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2#. A computer system as set forth in claim 18, further 
comprising a persistent memory in which objects, including 
said threads, said virtual processors and said virtual 
machines, reside. 

21. A computer system as set forth in claim 18, where the 
correspondence between threads and virtual processors and 
between virtual processors and physical processors and 
between virtual machines and physical machines are 
dynamically alterable. 
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